Skip to content

Add BRIN support for spoint and sbox #55

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 27, 2023
Merged

Conversation

vitcpp
Copy link
Contributor

@vitcpp vitcpp commented Aug 16, 2023

This is the adaptation of the patch introduced in akorotkov/pgsphere#8 by @gbroccolo with some small fixes. I appreciate any help with reviewing and testing of this patch.

@vitcpp vitcpp changed the title Brin support Add BRIN support for spoint and sbox Aug 16, 2023
@vitcpp vitcpp changed the title Add BRIN support for spoint and sbox Add BRIN support for spoint and sbox (#52) Aug 16, 2023
@vitcpp vitcpp changed the title Add BRIN support for spoint and sbox (#52) Add BRIN support for spoint and sbox (issue #52) Aug 16, 2023
@vitcpp vitcpp changed the title Add BRIN support for spoint and sbox (issue #52) Add BRIN support for spoint and sbox Aug 16, 2023
@vitcpp vitcpp linked an issue Aug 16, 2023 that may be closed by this pull request
Copy link
Contributor

@esabol esabol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor tweaks, but otherwise looks good to me.

It would be nice if BRINs supported scircles and the other s-types as well, but maybe in a follow-up PR if there's interest.

README.pg_sphere Outdated
@@ -11,6 +11,7 @@ It provides:

This is an R-Tree implementation using GiST for spherical objects like
spherical points and spherical circles with useful functions and operators.
It also support the Block Range INdexing (BRIN) for large datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "support" to "supports".

doc/indices.sgm Outdated
of data blocks (<literal>pages</literal>) on physical storage in order to
organize data searches on ranges of summarized data that can be easily skipped
on the base of search filters (see <ulink
url="https://www.postgresql.org/docs/9.5/static/brin-intro.html"><citetitle>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc/indices.sgm Outdated
of pages are specified, the higher granularity is reached during
the searches, and performance's gap between GiST indexes and BRINs
is lower (consider that BRINs size increases as well). Different
summarizations can be used with the following command:
Copy link
Contributor

@esabol esabol Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole paragraph (lines 93-97) could be better phrased. Here is what I recommend:

By default, BRINs summarize blocks of 128 pages. The smaller the number of pages specified, the higher the granularity in searches, and the gap in performance between GiST indexes and BRINs will be decreased. (Note that the size of the BRINs increases as well.) Different summarizations can be specified with the following command:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does make sense. I was thinking if it makes sense as well to add a reference to the fact that data in the pages should have some kind of "natural clustering" in the pages themselves: since the Block Range INdex keep the summarised information about the MBR that includes all the geometries in single page, then in case the geometries are widely distributed in the pages it will result for most of them an MBR which is pretty identical to the MBR of the whole dataset, making the index useless. This might happen for example when the geometries are imported in bulk in the DB through parallel importer.

In this case, the solution is to use a GiST index to CLUSTER the table at the end of the bulk import, operation that can take some time for very large DBs but it needs to be done only once.

For future inserts, autovacuum takes care for regular maintenance of the index with unsummarized pages. Further details here: https://www.postgresql.org/docs/current/brin-intro.html

Copy link
Contributor Author

@vitcpp vitcpp Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esabol Not sure, the phrase "(Note that the size of the BRINs increases as well.) " requires the braces. May be, just remove braces? It will be a regular sentense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you mean parentheses, not braces. They can be removed. I only included them because the original text had them.

Copy link

@gbroccolo gbroccolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically agree the rest of the comments from @esabol. I added a single one regarding linking users to further documentation about the maintenance of the index, which was missing in the original PR.

doc/indices.sgm Outdated
of pages are specified, the higher granularity is reached during
the searches, and performance's gap between GiST indexes and BRINs
is lower (consider that BRINs size increases as well). Different
summarizations can be used with the following command:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does make sense. I was thinking if it makes sense as well to add a reference to the fact that data in the pages should have some kind of "natural clustering" in the pages themselves: since the Block Range INdex keep the summarised information about the MBR that includes all the geometries in single page, then in case the geometries are widely distributed in the pages it will result for most of them an MBR which is pretty identical to the MBR of the whole dataset, making the index useless. This might happen for example when the geometries are imported in bulk in the DB through parallel importer.

In this case, the solution is to use a GiST index to CLUSTER the table at the end of the bulk import, operation that can take some time for very large DBs but it needs to be done only once.

For future inserts, autovacuum takes care for regular maintenance of the index with unsummarized pages. Further details here: https://www.postgresql.org/docs/current/brin-intro.html

@gbroccolo
Copy link

Some minor tweaks, but otherwise looks good to me.

It would be nice if BRINs supported scircles and the other s-types as well, but maybe in a follow-up PR if there's interest.

Yes, it is possible to add the support for the other s-types, it's just enough to use the conversion to the 3D-MBR of the spherical geometry defined in src/key.c.

But we should add all the BRIN support functions for each type, so maybe it's better to create single PR for each of it.

Also (and this can be considered an additional note for this PR) it might be worth to create different source files for each data type in order to avoid that src/brin.c could end in thousands of line of code in a single file.

@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 17, 2023

It would be nice if BRINs supported scircles and the other s-types as well, but maybe in a follow-up PR if there's interest.

I agree, lets do it as a separate task.

@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 17, 2023

Some minor tweaks, but otherwise looks good to me.
It would be nice if BRINs supported scircles and the other s-types as well, but maybe in a follow-up PR if there's interest.

Yes, it is possible to add the support for the other s-types, it's just enough to use the conversion to the 3D-MBR of the spherical geometry defined in src/key.c.

But we should add all the BRIN support functions for each type, so maybe it's better to create single PR for each of it.

Also (and this can be considered an additional note for this PR) it might be worth to create different source files for each data type in order to avoid that src/brin.c could end in thousands of line of code in a single file.

I like the idea to add support in different PR. Lets compete and deliver this PR. I also like the idea to create different source files for each data type. I would prefer to have a big number of small files with clear names, than to have one file with a lot of code. I propose to keep this PR unchanged and make such work in the future PRs.

@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 17, 2023

@esabol, @gbroccolo I did some fixes. Thank you! I also rebased the branch. I removed the unnecessary braces in the original sentense: (Note that the size of the BRINs increases as well.). Please, let me know if I did wrong. I will rollback it.

Copy link
Contributor

@esabol esabol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more tweak to the documentation, please.

doc/indices.sgm Outdated
time of operators <link
and Block Range INdexing (<literal>BRIN</literal>) algorithms to create
spherical indices.
<literal>GiST</literal> index represents the R-Tree implementation for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "index represents the" to "indexes utilize an" on this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I will change it. What do you think if to change 'R-Tree' to 'R-tree'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think if to change 'R-Tree' to 'R-tree'?

Yes, good idea!

README.pg_sphere Outdated
@@ -11,7 +11,7 @@ It provides:

This is an R-Tree implementation using GiST for spherical objects like
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to change R-Tree to R-tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please.

doc/indices.sgm Outdated
the searches, and performance's gap between GiST indexes and BRINs
is lower (consider that BRINs size increases as well). Different
summarizations can be used with the following command:
By default, BRINs summarize blocks of 128 pages. The smaller the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also propose to change BRINs to 'BRIN indexes' !?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the PostgreSQL documentation uses "BRIN indexes", so I think that would be correct.

Copy link
Contributor

@esabol esabol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I found some more things in the documentation that should be fixed, I think.

<literal>GiST</literal> indexes utilize an R-tree implementation for
spherical objects, while <literal>BRIN</literal> indexes are based on "summarization"
of data blocks (<literal>pages</literal>) on physical storage in order to
organize data searches on ranges of summarized data that can be easily skipped
Copy link
Contributor

@esabol esabol Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either change "data blocks" to "data block" or add an apostrophe after "data blocks" to make it possessive. "data block pages" or "data blocks' pages".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esabol The 'pages' word is written in the brackets. It is how the text is shown in the generated pdf: "summarization" of data blocks (pages) on physical storage...

I think the word pages is used here as the synonym of data blocks. Not sure, we should make this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I didn't see the parentheses. Do not change this.

doc/indices.sgm Outdated
on the base of search filters (see <ulink
url="https://www.postgresql.org/docs/current/brin-intro.html"><citetitle>
PostgreSQL documentation</citetitle></ulink> for further details on BRIN indexes).
As a consequence, BRIN indexes result to be really small indexes (up to 1000 times smaller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "result to be really" to "are very".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I understand the final sentence. I would like to change this sencence to something like this: BRIN indexes consume significantly less physical storage than GiST indexes.

Copy link
Contributor

@esabol esabol Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to change this sencence to something like this: BRIN indexes consume significantly less physical storage than GiST indexes.

Fine. 👍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitcpp it's not only related to less used storage. It's more because indexes are more efficiently used as far as more 8kB pages of the index are in-memory kept: the more RAM you have available to host the shared buffers in PostgreSQL, the more an index can be used for large amount of data, even if the resulting index is big. The good point of using BRINs is because they fits easily in the shared buffers even in case of large data stored in relatively small machines.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But regarding the correction: it's fine, and it intrinsecally includes what I said above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be to simplify the sentense like this:
As a consequence, BRIN indexes are small (up to 1000 times smaller ...)?

I've prepared the commit with such change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be to simplify the sentense like this: As a consequence, BRIN indexes are small (up to 1000 times smaller ...)?

I've prepared the commit with such change.

No, that's not proper English. Either go with what I wrote in my original comment or what you wrote in the second comment in this thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I've changed as you proposed. Thank you.

doc/indices.sgm Outdated
and Block Range INdexing (<literal>BRIN</literal>) algorithms to create
spherical indices.
<literal>GiST</literal> indexes utilize an R-tree implementation for
spherical objects, while <literal>BRIN</literal> indexes are based on "summarization"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "the" just before "summarization" and re-word-wrap the whole paragraph if lines are longer than 79 characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esabol I propose not to wrap paragraphs right now because some other unchanged paragraphs do not follow this rule. It will produce more diffs to make the review process more difficult. I see that postgresql doc try to follow this rule but with tag identation is 1 characher. I have some plans to discuss with my colleagues existing best practices and adopted rules in postgresql community to create docs and to reformat pgSphere docs to follow these rules. In this task we can fix long lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, just add the "the" then.

@gbroccolo
Copy link

@vitcpp for me the PR is fine and can be merged by any committer of this repo at any moment.

@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 21, 2023

I propose to add a test for sbox BRIN as well. I've added a test for sbox BRIN index but I do not see that the index was selected by the planner. Sequential scan is still used. I just copied the test for spoint with some trivial changes. I haven't digged deeply how the test works. May be I missed something.

@esabol
Copy link
Contributor

esabol commented Aug 21, 2023

I've added a test for sbox BRIN index but I do not see that the index was selected by the planner. Sequential scan is still used. I just copied the test for spoint with some trivial changes. I haven't digged deeply how the test works. May be I missed something.

I think the table needs to be clustered on the parameter on which you have created the BRIN index.

@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 22, 2023

@esabol I think, BRIN index should be created without clustering but I'm not sure about it. BRIN index is not used in the test brin_sbox. Seq scan is used instead. I still can't achieve it in the test. We may marge PR as it is now but later sbox_brin should be improved.

@esabol
Copy link
Contributor

esabol commented Aug 22, 2023

The few examples of BRIN usage I can find on the Internet all cluster the table. 🤷

Apparently, you can SET LOCAL enable_seqscan = OFF; to force the BRIN index to be used?

https://postgrespro.com/list/thread-id/2638982

@esabol
Copy link
Contributor

esabol commented Aug 22, 2023

And the StackOverflow answers here all say that "the BRIN index key has to correspond to the physical ordering of the data," which is what clustering achieves.

https://stackoverflow.com/questions/72614493/brin-index-usage

@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 22, 2023

@esabol Yes, you are right in general. BRIN index follows the physical ordering of the data. But BRIN index is created after the table is populated with some data. If you think to use CLUSTER command, it is not the case. This command requires already created index and it clusters the data in the table in according with the already created index.

As I know, the basic idea of BRIN index is to separate the whole table into a number of ranges. Each range may contain one or more pages (blocks). BRIN index contains some summary of each range: min/max values or bounding rectangle of the geometrical data, stored in the pages of the range. Such separation allows to select ranges for seq scan or ignore it very fast. It is why bitmap scan is used with BRIN index. Such indexing works well then table data are already sorted well because BRIN index operates by pages (blocks).

I guess, there are two possible sources of the problem:

  • data is too small - optimizer selects seq scan unconditionally
  • the definition of sbox BRIN index is not complete

@esabol
Copy link
Contributor

esabol commented Aug 22, 2023

@esabol Yes, you are right in general. BRIN index follows the physical ordering of the data. But BRIN index is created after the table is populated with some data. If you think to use CLUSTER command, it is not the case. This command requires already created index and it clusters the data in the table in according with the already created index.

Correct, but you can drop the index that you use to cluster the table with after the clustering is done, like this:

CREATE INDEX idx_sometable_someparam ON sometable(someparam);
CLUSTER sometable USING idx_sometable_someparam;
DROP INDEX idx_sometable_someparam;
CREATE INDEX idxbrin_sometable_someparam ON sometable USING brin(someparam);

If sometable is already ordered when loaded into the database, then you don't need to do this, but lots of BRIN examples do exactly this. Some examples also suggest VACUUM-ing and ANALYZE-ing the table as well.

Did you try SET LOCAL enable_seqscan = OFF;?

@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 23, 2023

@esabol I did what you proposed. I clustered the table first then I created the BRIN index. I also increased the amount of the data in the table. I duplicated the original set of rows many times. The same result - I didn't see that BRIN index was applied.

@gbroccolo
Copy link

@esabol I did what you proposed. I clustered the table first then I created the BRIN index. I also increased the amount of the data in the table. I duplicated the original set of rows many times. The same result - I didn't see that BRIN index was applied.

Hi @esabol and @vitcpp, sorry for my late reply here, but it's kind of busy period for me in these days.

Regarding your test: you are both right saying that BRINs perform well at the moment that the data is phisically organised in the 8kB pages in order to avoid to have block of pages would correspond to similar MBRs including their entries, as I tried to explain here. Note also that CLUSTER on preexisting indexes is not the only way to achieve this: also VACUUM and autovacuum bring to the same result, because both of them trigger pages' resumming as explained here.

However, this is not going to solve the problem: here BRINs are not used simply because they are not convenient (in terms of I/O and access to memory operations) from PostgreSQL point of view: for small datasets, inspecting the index and select the single blocks of pages to be then sequentially scanned (what is called bitmapscan in the planner) is generally more expensive than a direct sequential scan of all pages of a table (seqscan). It is something that generally happens also for GiST indexes, where for small datasets an indexscan is more expensive than a seqscan.

Now, we have to think about what we want to achieve here: for a regression test is not required to test when the planner choose a plan rather than another (which is something that depends on a lot of factors: underlying hardware, size of the data, nature of the data, configuration of the PostgreSQL engine, ...), but to check that the result obtained through the index would be identical to the one obtained by the sequential scan. This is what I tried to achieve for example in this test I wrote for spoint: I filled a test table with 77 spherical point (more or less close to each other, just to have a good summarization as starting point, but it's not necessary for the purpose of the test) then I was checking if the result with a sequential scan is identical to the one through a bitmap scan.

Hope I helped with this long comment.

@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 24, 2023

@esabol , @gbroccolo It seems I've found and fixed the problem with sbox support for BRIN index. Some operators in brin_inclusion_spheric_ops were absent (please, see my last commit). Now the BRIN index seems to be chosen by the planner.

@esabol
Copy link
Contributor

esabol commented Aug 24, 2023

It seems I've found and fixed the problem with sbox support for BRIN index.

Fantastic! Great work, @vitcpp !

@vitcpp vitcpp force-pushed the brin-support branch 3 times, most recently from 999173a to a6f9cba Compare August 25, 2023 08:27
@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 25, 2023

@esabol , @gbroccolo I've prepared the PR for merge. I squashed my commits into one. I did no more changes except of adding sbox_brin to the list of tests of installcheck rule. I think the PR is ready for merge. If everything is ok, I will merge. Thank you very much for your help!

Copy link
Contributor

@esabol esabol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 25, 2023

Not sure, what hapenned, but moc100 test is failed now unfortunately.

@esabol
Copy link
Contributor

esabol commented Aug 25, 2023

Not sure, what happened, but moc100 test is failed now unfortunately.

Did the PostgreSQL version used in the PG16 build change? 16 beta 3 was released on August 10. When was the last time the PG16 build passed? Maybe it took 2 weeks before 16 beta 3 showed up in Travis CI? If the only reason for the failure is going from 16 beta 2 to beta 3, then this change in the output might need to be reported to the PostgreSQL dev team.

My first thought was that it was due to a side effect of the SQL in the BRIN test, but that can't be since PR #56 also failed its PG16 build.

If this is an intentional change by the PostgreSQL developers in PostgreSQL 16, then we're going to need different expected outputs for pre-16 and 16+. (How annoying!)

@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 25, 2023

@esabol I did some changes in the PR 2 days ago and PG16 tests were passed ok. My last change is to squash my commits. I guess, something is changed on travis-ci. Furthermore, my last PR #56 is failed as well. I will try to fix the differences in the output - we do not need to validate the behaviour of postgresql here. If we fix the current behaviour we can see some changes in the future.

@vitcpp
Copy link
Contributor Author

vitcpp commented Aug 25, 2023

@esabol I adjusted expected/moc100_4.out file which fixed the issue. I'm going to do this change in a separate PR. I propose to fix moc100 first, then rebase and merge this PR.

Giuseppe Broccolo and others added 2 commits August 26, 2023 09:21
Adjust style using pgindent, execute regression test for BRIN code just for PG>9.5
Move brin.c(h) files into src subdirectory

Add brin support in upgrade script

Fix some code problems

Fix test_init

Fix Indices section in the doc

Add test for sbox BRIN index

Fix found problems in the doc after review

Fix BRIN support for sbox
Copy link

@gbroccolo gbroccolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. Thanks @vitcpp.

@vitcpp vitcpp merged commit c79fccc into postgrespro:master Aug 27, 2023
@vitcpp vitcpp deleted the brin-support branch August 27, 2023 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add BRIN support for spoint and sbox?
3 participants